SOLR-18220 Add support for countDist in rollup for streaming expressions#4394
SOLR-18220 Add support for countDist in rollup for streaming expressions#4394KhushJain wants to merge 3 commits intoapache:mainfrom
Conversation
|
So was this a partly started feature that was never finished?? |
There was a problem hiding this comment.
Pull request overview
Adds countDist support to streaming-expression rollups by turning CountDistinctMetric from a stub into a working metric, updating rollup docs, and extending rollup tests.
Changes:
- Implement
CountDistinctMetricvalue tracking and simplify its emitted stream expression. - Extend rollup and hash-rollup tests to assert distinct counts for integer/string fields.
- Document
countDist(col)as a supportedrollupmetric and add an unreleased changelog entry.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
solr/solrj-streaming/src/java/org/apache/solr/client/solrj/io/stream/metrics/CountDistinctMetric.java |
Implements client-side distinct counting and updates expression parsing/serialization. |
solr/solrj-streaming/src/test/org/apache/solr/client/solrj/io/stream/StreamingTest.java |
Adds programmatic rollup assertions for CountDistinctMetric. |
solr/solrj-streaming/src/test/org/apache/solr/client/solrj/io/stream/StreamDecoratorTest.java |
Adds expression-based rollup/hashRollup assertions for countDist and updates fixture data. |
solr/solr-ref-guide/modules/query-guide/pages/stream-decorator-reference.adoc |
Documents countDist(col) in rollup supported metrics and syntax example. |
changelog/unreleased/SOLR-18220-support-countdist-in-rollup.yml |
Adds the unreleased changelog entry for the feature. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public void update(Tuple tuple) { | ||
| // Nop for now | ||
| Object value = tuple.get(columnName); | ||
| if (value != null) { | ||
| distinctValues.add(value); | ||
| } |
There was a problem hiding this comment.
APPROX_COUNT_DISTINCT / hll constant existed but the code was a complete no-op. I'd argue that's a separate feature and needs its own function registered like hll()
| if (1 != expression.getParameters().size()) { | ||
| throw new IOException( | ||
| String.format(Locale.ROOT, "Invalid expression %s - unknown operands found", expression)); |
There was a problem hiding this comment.
The old constructor silently ignored the second parameter anyway, so no one could have depended on it. This PR fixes the serialization to match what the constructor actually accepts
| public StreamExpressionParameter toExpression(StreamFactory factory) throws IOException { | ||
| return new StreamExpression(getFunctionName()) | ||
| .withParameter(columnName) | ||
| .withParameter(Boolean.toString(outputLong)); | ||
| return new StreamExpression(getFunctionName()).withParameter(columnName); |
There was a problem hiding this comment.
This is the SQL module's map-reduce path, not the streaming expression path. This PR targets countDist in streaming expression rollup()
The class existed and worked for facets and stats where it was just used as an identifier that gets push down to json facet api to do the actual computation. So update()/getValue() was never called in that workflow. |
|
@epugh The PR check failed on 2 existing flaky tests on CloudConsistency. |
https://issues.apache.org/jira/browse/SOLR-18220
Description
Rollupfunction to supportcountDist(count distinct) statistics in/streamhandler.Solution
Implement the existing
CountDistinctMetricstub in the streaming expressions framework:CountDistinctMetric.java:update()did nothing andgetValue()returned null. Now usesHashSet<Object>to track distinct non-null values per group.toExpression()which emitted a spuriousoutputLongparameter producing malformed expressions likecountDist(a_i,true)StreamExpressionconstructor.Tests
Updated existing tests in
StreamDecoratorTest.javaandStreamingTest.java:StreamDecoratorTest.testRollupStream: AddedcountDist(a_i)andcountDist(a_s)to the expression and asserted.StreamDecoratorTest.testHashRollupStream: Same additions for hash-based rollup.StreamingTest.testRollupStream: AddedCountDistinctMetric("a_i")andCountDistinctMetric("a_s")to the metrics array and asserted, including the null grouping field test.Checklist
Please review the following and check all that apply:
mainbranch../gradlew check.